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Abstract 



We consider the prediction problem of a continuous-time stochastic process on an entire 
time-interval in terms of its recent past. The approach we adopt is based on functional kernel 
nonparametric regression estimation techniques where observations are segments of the observed 
process considered as curves. These curves are assumed to lie within a space of possibly 
inhomogeneous functions, and the discretized times series dataset consists of a relatively small, 
compared to the number of segments, number of measurements made at regular times. We 
thus consider only the case where an asymptotically non-increasing number of measurements 
is available for each portion of the times series. We estimate conditional expectations using 
appropriate wavelet decompositions of the segmented sample paths. A notion of similarity, based 
on wavelet decompositions, is used in order to calibrate the prediction. Asymptotic properties 
when the number of segments grows to infinity arc investigated under mild conditions, and a 
nonparametric resampling procedure is used to generate, in a flexible way, valid asymptotic 
pointwisc confidence intervals for the predicted trajectories. We illustrate the usefulness of the 
proposed functional wavelet-kernel methodology in finite sample situations by means of three 
real-life datasets that were collected from different arenas. 

Some key words: a-MixiNG; Besov Spaces; Continuous-Time Prediction; 
Functional Kernel Regression; Pointwise Prediction Intervals; Resampling; 
SARIMA Models; Smoothing Splines; Wavelets 

1 INTRODUCTION 

In many real life situations one seeks information on the evolution of a (real-valued) continuous- 
time stochastic process X = (X(t); t € R) in the future. Given a trajectory of X observed on the 
interval [0, T], one would like to predict the behavior of X on the entire interval [T,T + 8], where 
5 > 0, rather than at specific time-points. An appropriate approach to this problem is to divide 
the interval [0, T] into subintervals [IS, (I + 1)8], I = 0, 1, . . . , i — 1 with 8 = T/i, and to consider the 
stochastic process Z = {Z; L ; i € N + ), where N + = {1, 2, . . .}, defined by 

Zi{t) = X{t+{i-l)8), iGN + , Vie [0,8). (1) 

This representation is especially fruitful if X has a seasonal component with period 8 and can 
be decomposed into locally stationary parts. It can be also employed if the data are collected as 
curves indexed by time-intervals of equal lengths; these intervals may be adjacent, disjoint or even 
overlapping (see, for example, Ramsay & Silverman, 1997). 

In the recent literature, practically all investigations to date for this prediction problem are for 
the case, where one assumes that an appropriately centered version of the stochastic process Z is a 
(zero-mean) Hilbert- valued autoregressive (of order 1) processes (ARH(l)); the best prediction of 
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Z n+ i given its past history (Z n , Z n -\, ... ,Z\) is then given by 

Z n +i = E(Z n +i | Z n , Z n -x, . . . , Z{) 
= p(Z n ), n € N + , 

where p is a bounded linear operator associated with the ARH(l) process. The adopted approaches 
mainly differ in the way of estimating the 'prediction' operator p, or its value p(Z n ) given 
Zi, Z2, ■ ■ ■ , Z n (see, e.g., Bosq, 1991; Besse &: Cardot, 1996; Bosq, 2000; Antoniadis & Sapatinas, 
2003). 

In many practical situations, however, the stochastic process Z may not have smooth sample 
paths (lying in H ) or may not be modelled with such an autoregressive structure. This is the case 
that we consider in the following development. In particular, we also assume that the (real- valued) 
continuous-time stochastic process X = (X(t); t £ W) possesses a representation of the form Q 
with short duration 'blocks' Zi, for i 6 N + . We then develop a version of prediction via functional 
regression analysis, in which both the predictor and response variables are functions of time, using a 
conditioning idea. Under mild assumptions on the observed time series, a one time-interval ahead 
prediction of the 'block' Z n+ \ is obtained by kernel regression of the present 'block' Z n on the 
past 'blocks' {Z n _i, Z n _2, • • • , Z{\. The resulting predictor will be seen as a weighted average of 
the past 'blocks', placing more weight on 'blocks' that are similar to the present one. Hence, the 
analysis is rooted in the ability to find 'similar blocks'. Considering that 'blocks' can be quite 
irregular curves, similarity matching is based on a distance metric on the wavelet coefficients of 
an appropriate wavelet decomposition of the 'blocks'. A resampling scheme, involving resampling 
of the original 'blocks' to form 'pseudo-blocks' of the same duration, is then used to calculate 
pointwise prediction intervals for the predicted 'block'. 

The paper is organized as follows. In Section 2, we introduce basic notions for continuous-time 
prediction, relevant notions on wavelet-based orthogonal expansions of continuous-time stochastic 
processes, and describe the strictly stationarity and a-mixing assumptions that are going to be 
adopted for forecasting. In Section 3, we discuss the extension of the conditioning approach to the 
one time-interval ahead prediction. Resampling-based pointwise prediction intervals are presented 
in Section 4. In Section 5, we illustrate the usefulness of the proposed functional wavelet-kernel 
approach for continuous-time prediction by means of three real-life datasets that were collected 
from different arenas. We also compare the resulting predictions with those obtained by two other 
methods in the literature, in particular with a smoothing spline method and with the SARIMA 
model. Some concluding remarks are made in Section 6. Proofs and auxiliary results are compiled 
in the Appendix. 
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2 PRELIMINARIES AND NOTATIONS 



2.1 Generalities 

Let (0,«4,P) be a probability space, rich enough so that all random variables considered in the 
following development can be defined on this space, and let X = (X(t); f S M) be a (real- valued) 
continuous-time stochastic process defined on this space. Motivated by applications to prediction 
and forecasting, it is supposed that the time-interval on which the continuous-time stochastic 
process is observed is divided into intervals of constant- width 5 > so that, from X, a functional- 
valued random variable sequence (Zf, i £ N + ) is constructed according to the representation 
i.e., 

Zi(t) =X(t + (i-l)5), i G N + , Vie [0,<5). 

In the sequel, we regard the Z,'s as elements of a certain (semi-)normed functional linear space 
H equipped with (semi-)norm || • || and its Borel cr-field Recall that our aim is a one-ahead 

time interval prediction which, under the above notation, is reduced in studying a corresponding 
problem for the H- valued time series Z = (Zf, i € N + ). In what follows, we assume that the time 
series Z is strictly stationary with E(||Zj||) < oo. If the time series Z is not stationary, it is assumed 
that it has been transformed to a stationary one by a preprocessing procedure, and the procedures 
that we are going to develop hold for the resulting stationary time series. 

In practice, the random curves Zj are only known at discretized equidistant time values, say 
t\, . . . ,tp. Thus, they must be approximated by some -ff-valued functions, which in our case will 
be realized by first expanding the Z^s into a wavelet basis and then estimating consistently the 
coefficients from the observed discrete data. We may have used a fixed spline basis or a Fourier 
basis instead but there are some good reasons to prefer wavelet bases. When P is fixed, using spline 
interpolation could make sense if the sample paths exhibits a uniformly smooth temporal structure 
thus not requiring any smoothing to stabilize the variance. When P is large, a necessary smoothing 
step in the spline approximation would be necessary and since the smoothing is global this would 
not be appropriate when the sample paths are composed of different temporal structures. The 
same remarks apply for a Fourier basis. On the contrary, when P is fixed and the sample paths 
are continuous, one may choose an interpolation wavelet basis and the wavelets coefficients are 
computed directly from the sampled values instead of inner product integrals. When Zi are observed 
either continuously or on a very fine discretization grid, then wavelets can be used successfully for 
compression of a continuous-time stochastic process, in the sense that the sample paths can be 
accurately reconstructed from a fraction of the full set of wavelet coefficients. Whatever the setting 
is, the wavelet decomposition of the sample paths will be a local one, so that if the information 
relevant to our prediction problem is contained in a particular part or parts of the sample path, 
as it is typically the case in many practical applications, this information will be carried by a very 
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small number of wavelet coefficients. 

Below, we recall some background on interpolating wavelets and orthonormal wavelet expansions 
of continuous-time stochastic processes that we are going to use in the subsequent development. 



2.2 Orthonormal wavelet expansions of continuous-time stochastic processes 

The discrete wavelet transform (DWT), as formulated by Mallat (1989) and Daubechies (1992), is 
an increasingly popular tool for the statistical analysis of time series (see, e.g., Nason & von Sachs, 
1999; Percival & Walden, 2000; Fryzlewicz, Van Bellegem & von Sachs, 2003). The DWT maps 
a time series into a set of wavelet coefficients, each one associated with a particular scale. Two 
distinct wavelet coefficients can be either 'within-scale' (i.e., both are associated with the same 
scale) or 'between-scale' (i.e., each has a distinct scale). 

One reason for the popularity of the DWT in times series analysis is that measured data from 
most processes are inherently multiscale in nature, owing to contributions from events occurring 
at different locations and with different localization in time and frequency. Consequently, data 
analysis and modelling methods that represent the measured variables at multiple scales are better 
suited for extracting information from measured data than methods that represent the variables at 
a single scale. 

Let S(t) = S(u,t) be a mean-square continuous-time stochastic process defined on f2 x [0,1], 

i.e. 



S € L 2 (n x [0, 1]) = <^ X(t) : Q -> R, t G [0, 1] 



Recall that (see, e.g., Neveu, 1968) L 2 (tt x [0,1]) is a separable Hilbert space equipped with inner 
product defined by 



(X 1 ,X 2 ) =E^j\ 1 (t)X 2 (t)dt 



To develop a wavelet-based orthonormal expansion, we mimic the construction of a (regular) 
multiresolution analysis of L 2 ([0, 1]) (see, e.g., Mallat, 1989). In other words, consider a ladder 
of closed subspaces 

y J0 cy JO+1 c---cL 2 ([o,i]), 

with any fixed jo > 0, whose union is dense in L 2 ([0, 1]), and where, for each j, Vj is spanned by 2 J 
orthonormal scaling functions {4>j,k '■ k = 0, . . . , 2 J ' — 1}, such that supp(0 J; fc) C [2 -J '(A; — c), 2~i (k + 
c)], with c a constant not depending on j. At each resolution level j, the orthonormal complement 
Wj between Vj and Vj+i is generated by 2 J orthonormal wavelets {4>j t k '■ k = 0, . . . , 2 J ' — 1}, such that 
supp(V'j > fc) C [2 -J '(A; — c), 2 _j, (/c + c)]. As a consequence, the family ^j>j {^j,k '■ k = 0, . . . , 2 J ' — 1}, 
completed with {4>j (h k '■ k = 0, . . . ,2?° — 1}, constitutes an orthonormal basis of L 2 ([0, 1]). 
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Similarly, we define a sequence of approximating spaces of L 2 (Q, x [0, 1]) by 



Vj(n x [0,1]) = < 



X G L 2 (n x [0,1]) 



2^-1 



2»-l 



*w = E E E &.*) 2 < 00 



fc=0 



fc=0 



where : /c = 0, . . . , 2 J — 1} is a sequence of random variables and 4>j^ is the scaling basis of 
Vj. Note that since L 2 (Q, x [0,1]) is isomorphic to the Hilbert tensor product L 2 (f2) ® L 2 ([0, 1]), 
the stochastic approximating spaces 1^ (0 x [0, 1]) are closed subspaces of L 2 (VL x [0, 1]). Note also 
that for every X G Vj(Sl x [0, 1]), one has E(X) G L 2 ([0, 1]) since 

2 



/ 



[E(X(t))] 2 dt 



E fj.fc^.fcW 

fc=0 



2^-1 



2^-1 



= e pEfo*)] 2 < E E &.*) 2 , 



fc=0 



fc=0 



by orthonormality of the scaling functions. Following Cohen & D'Ales (1997), it is easy to see that 
{Vj(Q x [0, 1]) : j G N } is a multiresolution analysis of L 2 (£l x [0, 1]). Moreover if Wj(Q x [0, 1]) 
denotes the orthonormal complement of Vj(Q x [0, 1]) in V} + i(Q x [0, 1]), then one naturally has 
the following stochastic wavelet orthonormal expansion 

2*>-l oo 2^-1 

X G L 2 (n x [0, 1]) w X(t) = ^ ^o,fc^o,fc(*) + E E 

fc=o i=io k=o 

where = Jq c/)j : h(t)X(t) dt and rjj^ = i/}j^(t)X(t) dt. The above remarks clearly show that 
the wavelet-based orthonormal expansion is a fundamental tool for viewing the continuous-time 
stochastic process in both time and scale domains. 

In order to allow for inhomogeneous sample paths, the notion of the Besov space (-B| ? ) comes 
naturally into the picture. Without getting into mathematical details, we just point out that Besov 
spaces are known to have exceptional expressive power: for particular choices of the parameters s, 
p and q, they include, e.g., the traditional Holder (p = q = oo) and Sobolev (p = q = 2) classes 
of smooth functions, and the inhomogeneous functions of bounded variation sandwiched between 
B\ 00 and B\]_. The parameter p can be viewed as a degree of function's inhomogeneity while 
s is a measure of its smoothness. Roughly speaking, the (not necessarily integer) parameter s 
indicates the number of function's (fractional) derivatives, where their existence is required in an 
L p -sense, while the additional parameter q provides a further finer gradation (see, e.g.,. Meyer, 
1992). Therefore, from an approximation perspective, if the sample paths of the continuous-time 
stochastic process X belong to an inhomogeneous Besov space of regularity s > 0, and if one 
uses regular enough scaling functions, one may approximate in L 2 (J7 x [0, 1]) any sample path of 
the process X by its projection onto Vj at a rate of the order 0(2~ sJ ) (see Cohen & D'Ales, 
1997, Theorem 2.1). This is, in fact, a simple rephrasing, in the stochastic framework, of the 
deterministic results on the multiresolution approximation of functions in Besov spaces when the 
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error is measured in the L 2 ([0, l])-norm. This result has the advantage that dimension reduction 
by basis truncation in the wavelet domain is controlled more precisely. 

Consider now the case where the stochastic signal is sampled over a finite number P of 
equidistant points and assume that P = 2 J . One then may use an interpolating wavelet basis, 
as the one constructed by Donoho (1992), to interpolate the observed values. The interpolating 
scaling function (p corresponds to the autocorrelation function of an orthogonal, regular enough, 
scaling function and the projection onto the scaling approximating space Vj is then given by 

2 J -1 

Vj(Z)(t)= Y,Z(t k )<l>(2 J t-k). 

k=0 

While this projector has no orthogonality property, it retains however the good approximations 
properties of projection operators derived from orthogonal multiresolution analyses (see Mallat, 
1999, Theorem 7.21). When P is not anymore a power of two, one may still use the above scheme 
by adapting the interpolating wavelet basis to the sampling grid using an appropriate subdivision 
scheme for interpolation (see Cohen, Dyn & Matei, 2003). 

We conclude this section by recalling the strictly stationarity and a-mixing concepts that we 
are going to adopt for developing the proposed functional wavelet-kernel prediction methodology. 

2.3 Strictly stationarity and a-mixing 

One of our main assumption in predicting the times series Z will be its strictly stationarity. 
We therefore recall some results on strictly stationarity from the above stochastic multiresolution 
analysis perspective. Recall that, for all X € Vj(p, x [0, 1]), we have 



k=0 

Therefore, 

2 J -1 2 J -1 

X(t + 8) = £ 6,fc0J,fc(i + s) = Uk2 J/2 <P(2 J (t + s)-k) 

k=0 k=0 
2 J -l 2 J -1 

= Y, Uk2 J/2 <P(2 J t - (k - 2 J s)) = Y Sj,k+2's<t>J,k(t)- 

k=0 k=0 

iFrom the above, it is easy to see that if X is a strictly stationary process then, at any resolution 
level j, the vector of its scaling coefficients, is also strictly stationary. As shown in Cheng & Tong 
(1996), the converse is also true. It follows that strictly stationarity of the discrete-time series Z 
implies strictly stationarity in the above sense of the sequence of the scaling coefficients vectors 
at any resolution. Note, moreover, that if the strictly stationarity assumption is too strong, one 
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could calibrate the non-stationarity by considering only J-stationarity (see Cheng & Tong, 1998), 
that is, strictly stationarity of the scaling coefficients up to (finest) resolution J, with eventually a 
different distribution at each resolution level j. 

Our theoretical results will be derived under a-mixing assumptions on the time series Z = 
(Zf, i G N + ). Recall that for a strictly stationary series Z = (Zf, i G N + ), the a-mixing coefficient 
(see Rosenblatt, 1956) is defined by 

a z (m) = sup \F(AnB) -¥(A)¥(B)\, 
Aev u Bev l+m 

where T>i = a(Zi,i < I) and T>i +m = a(Zi,i >l + m) are the a-fields generated by (Zf, i < I) and 
(Zf, i > I + m) respectively, for any m > 1. The stationary sequence Z = (Zf, i G N + ) is said to 
be a-mixing if az(m) — > as m — > oo. Among various mixing conditions used in the literature, 
a-mixing is reasonably weak (see, e.g., Doukhan, 1994). 

Since in the subsequent development we are dealing with wavelet decompositions, for each 
i G N + , denote by Sj = J ' k " > : k = 0, 1, . . . , 2 J — 1} the set of scaling coefficients up to resolution 
level J of the i-th segment Z{. Note that because Z = (Zf, i G N + ) is a strictly stationary 
stochastic process, the same is also true for the 2 J -dimensional stochastic process {Ef i G N + }. 
Furthermore, denote by Ajj = a(^f k ,i < I) and Ajj +m = a(^ ,k ,i > l + m) the cr-fields generated 
by (i\ J,kS> ; i < I) and (^f^; i>l + m) respectively. Because (r(^\' J ' k \i G I) C a(Zi,i G /) for any 
/ C N+, we get 

a J:k (m) = sup \¥(AnB) -¥(A)¥(B)\ 

AeAj,i,BeAj,i +m 

< sup \F(AC\B) -F(A)F(B)\ 

AeVjj,Bev Jil+m 

= a z (m). 

Note that the above observation is also true when dealing with sample paths discretized over a 
fixed equidistant grid on [0, S] of size P, since then £p } = Zi(t k ) for all k = 1, 2, . . . , P. 

3 A Functional Wavelet- kernel prediction 

3.1 Finite-dimensional Kernel Prediction 

Consider the nonparametric prediction of a (real- valued) stationary discrete-time stochastic process. 
Let X n ^ = (X n , X n _i, . . . , J„_ r+ i) G W be the vector of lagged variables, and let s be the forecast 
horizon. It is well-known that the autoregression function plays a predominant forecasting role in 
the above time series context. Recall that the autoregression function / is defined by 

/(x) = E(X n+s | X n>{r) = x), x G R r . 
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It is clear that the first task is to estimate /. The classical approach to this problem is to find some 
parametric estimate of /. More specifically, it is assumed that / belongs to a class of functions, 
only depending on a finite number of parameters to be estimated. This is the case of the very well- 
known ARIMA models, widely studied in the literature (see, e.g., Box & Jenkins, 1976; Brockwell 
& Davis, 1991). 

The above prediction problem can also be undertaken with a nonparametric view, without any 
assumption on the functional form of /. This is a much more flexible approach that only imposes 
regularity conditions on /. Nonparametric methods for forecasting in time series can be viewed, up 
to a certain extent, as a particular case of nonparametric regression estimation under dependence 
(see, e.g., Bosq, 1991; Hart, 1991; Hardle & Vieu, 1992; Hart, 1996). A popular nonparametric 
method for such task is to use the kernel smoothing ideas because they have good properties in 
real-valued regression problems both from a theoretical and a computational point of view. The 
kernel estimator /„ (based on Xi, . . . , X n ) of / is defined by 

f ( \ Z:= r S n(x-Xn,(r))/K)X t+s 
Et=r S n(x-Xn,(r))/h n ) ' 

or if the denominator is null. In our development, for simplicity, we consider a product kernel, 
i.e., for each x = (x±, . . . ,x r ), K(x) = Y]a=i K{ x i)\ a l so h n is a sequence of positive numbers (the 
bandwidths). The s-ahead prediction is then simply given by X n+S \ n = f n (X n ^). Theoretical 
results show that the detailed choice of the kernel function does not influence strongly the 
asymptotic behavior of prediction but the choice of the bandwidth values are crucial for prediction 
accuracy (see, e.g., Bosq, 1998). 

As it is readily seen, the prediction is expressed as a locally weighted mean of past values, where 
the weights measure the similarity between (X t ( r y, t = r, . . . ,n — s) and X n ^, taking into account 
the process history. Let now || • || be a generic notation for a Euclidean norm. If the kernel values 
decrease to zero as ||x|| increases, the smoothing weights have high values when the (X t ^) is close 
to A Ttj ( r ), and is close to zero otherwise. In other words, the prediction X n+S \ n is obtained as a 
locally weighted average of future blocks of horizon s in all blocks of length r in the past, weighted 
by similarity coefficients w U: t of these blocks with the current block, where 

K((x-X ti ( r) )/fen) 

3.2 Functional Wavelet-Kernel Prediction 

Recall that, in our setting, the strictly stationary time series Z = (Zf, i € N + ) is functional-valued 
rather than R- valued, i.e., each Z; L is a random curve. In this functional setup, and to simplify 
notation, we address, without loss of generality, the prediction problem for a horizon s = 1. We 
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could mimic the above kernel regression ideas, and use the following estimate 

n-l 

Z n+l\n{-) = ^ Wn,m,Zm+l(-), (2) 
m=l 

where the triangular-array of local weights {w n>m : m = 1, 2, . . . , n — 1; n G N + } increases with the 
closeness or similarity of the last observed path Z n and the paths Z m in the past, in a (semi-)norm 
sense; this is made more precise in © below. The literature on this infinite-dimension kernel 
regression related topic is relatively limited, to our knowledge. Bosq & Delecroix (1985) dealt 
with general kernel predictors for Hilbert- valued stationary Markovian processes. A similar idea 
was applied by Besse, Cardot & Stephenson (1999) for ARH(l) processes in the special case of a 
Sobolev space. Extending and justifying these kernel regression techniques to infinite-dimensional 
stochastic processes with no specific structures (e.g., ARH(l) or more general Markovian processes), 
will require using measure-theoretic assumptions on infinite-dimensional spaces (e.g., a probability 
density function with respect to an invariant measure) thus restricting the analysis and applicability 
to a small class of stochastic processes (e.g., diffusion processes). Such kind of assumptions are 
more natural in finite-dimensional spaces such as those obtained through orthonormal wavelet 
decompositions, especially when the discretized sample paths of the observed process are quite 
coarse. Taking advantage of these latter remarks, our forecasting methodology relies upon a wavelet 
decomposition of the observed curves, and uses the concepts of strictly stationarity and a-mixing 
within the stochastic multiresolution analysis framework discussed in Section 2. Moreover, note 
that using distributional assumptions such as those given in the appendix on the wavelet coefficients 
is much less restrictive than using similar assumptions on the original process Z. 

To summarize, the proposed forecasting methodology is decomposed into two phases 

Phi: find within the past paths the ones that are 'similar' to the last observed path (this determines 
the weights); 

Ph2: use the weights and the stochastic multiresoltion analysis to forecast by a locally weighted 
averaging process as the one described by ©. 

Since we are dealing with a wavelet decomposition, it is worth to isolate the first phase 
(Phi) by discussing possible ways to measure the similarity of two curves, by means of their 
wavelet approximation, and then to proceed to the second phase (Ph2), using again this wavelet 
approximation. The analysis of the proposed kernel-based functional prediction method is based on 
finding similar paths. Similarity is now defined in terms of a distance metric related to the functional 
space in which the sample paths lie. When the space is a Besov space, it is well-known that its 
norm is characterized by a weighted £ p -norm of the wavelet coefficients of its elements (see, e.g., 
Meyer, 1992). It is therefore natural to address the similarity issue on the wavelet decomposition 
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of the observed sample paths. The wavelet transform is applied to the observed sample paths, 
and due to the approximation properties of the wavelet transform, only a few coefficients of the 
transformed data will be used; a kind of contractive property of the wavelet transform. 

Applying the DWT to each path, decomposes the temporal information of the time series into 
discrete wavelet coefficients associated with both time and scale. Discarding scales in the DWT that 
are associated with high-frequency oscillations, provides a straightforward data reduction step and 

decreases the computational burden. We want to use the distributional properties of the wavelet 

(i) 

coefficients of the observed series. Imagine first that we are given 2 observed series, and let 9 - 
i = 1, 2, be the discrete wavelet coefficient of the DWT of each signal at scale j (j = jo, . . . , J — 1) 
and location k (k = 0, 1, . . . , 2? — 1). At each scale j > jo, define a measure of discrepancy in terms 
of a distance 

'v-i 



1/2 



k=0 

which measures how effectively the two signals match at scale j. To combine all scales, we then use 

3=30 

Such a measure of discrepancy is natural and is often used to test the equality of two regression 
functions in the wavelet domain (see, e.g., Spokoiny, 1996; Abramovich, Antoniadis, Sapatinas & 
Vidakovic, 2004). 

As for the second phase (Ph2), recall that, for each i £ N + , E$ = ■ k = 0, 1, . . . , 2 J — 1} 

denotes the set of scaling coefficients up to resolution level J of the i-th segment Z{. The kernel 
prediction of the scaling coefficients at time n + 1, H n+1 n , is given by 

J2 n m =iK(D(.C(E n ),C(^ m ))/h n )E m+1 



J )t+1 n 



(3) 



Vn + ErJi K(D(C(Z n ),C(E m ))/h n ) ' 

where the 1 jn factor in the denominator allows expression to be properly defined and does not 
affect its convergence rate. Here, for simplicity, we use the notation D(x,y)/h n = D(x/h n ,y/h n ), 
and C(Efc) is the set of wavelet coefficients obtained by applying the "pyramid algorithm" (see 
Mallat, 1989) on the set of (finest level) scaling coefficients H^, for k = 1, 2, . . . , n. This, leads to 
the time-domain prediction at time n + 1, 

2 J -1 

3h-i !„(*)= E&iin'M*), Vte[0,*), 



k=0 



of K(Z n+ i(-) | Z n (-)), where £i+i^| n are the components of the predicted scaling coefficients H n+1 | ra . 
The following theorem shows its consistency property. 
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Theorem 3.1 Suppose that the Assumptions (A1)-(A7), given in the Appendix, are true. 

(i) If t € {0, 1, . . . , 2 J — 1} and if h n = ( J , then, as n — > oo, we have 

Z J n+l]n (t) - E{Z n+1 (t)\Z n ) =0 2 J / 2 I, almost surely. 

(ii) If the sample paths are sampled on a grid of size 2 and if h n = I J , then, 
as n — > oo ; we /iGroe 

/ / 1q n l/(2+2 J ) \ 

Z n J +1|n (t) - E(Z n+1 (t)|Z n ) =0 2 J / 2 ^ +2" sJ , almost surely. 



Remark 3.1 Note that in both assertions of Theorem 13. 11 the size of the sampling grid over each 
segment affects the convergence rate of our predictor. In the first case, which is the most usual 
in practice, the rate becomes slower as the dimension P increases but we still have consistency 
as the number of segments increases to infinity. In the second case, however, an extra term is 
given by the wavelet approximation of the sample paths at resolution J and getting a larger J 
affects considerably the rate of the estimator which is the well-known problem of the "curse of 
dimensionality". One possible way to deal with this problem would be to look at the regressor 
in an infinite-dimensional space but, as already noted above, this is a difficult problem, since one 
would need some concentration assumption about the distribution of the functional-valued time 
series Z = i £ N + ) without referring to any particular probability density function. 

We conclude this section by pointing out that, as in any nonparametric smoothing approach, 
the choice of the smoothing parameter h n (the bandwidth) is of great importance. Once h n is 
specified, only time segments that lie within a similarity distance from the segment Z n within h n 
will be used to estimate the prediction at time n + 1. Intuitively, a large value of h n will lead to 
an estimator that incurs large bias, while a small value, might reduce the bias but the variability 
of the predicted curve could be large since only few segments are used in the estimation. A good 
choice of h n should balance the bias-variance trade off. In our implementation, we have used the 
leave-one out cross-validation for times-series data as suggested by Hart (1996). The principle of 
the cross-validation criterion is to select the bandwidth which, for our given prediction horizon 
s = 1, minimizes the mean squared prediction errors of the (i + l)-th segment using all segments 
in the past except the i-th, i.e., the value of h n that minimizes 

_. n— 1 

°nh)=— 1 j:\\z i+1 -zv lli 
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where Z A , . is the kernel regression estimate with bandwidth h obtained using the series without 
its i-th segment. This is the method for choosing the bandwidth adopted in the numerical results 
presented in Sectional 

4 Resampling-based Pointwise Prediction Intervals 

Apart from the prediction Z J n+l y n {t) discussed in Section |31 we also construct resampling-based 
pointwise prediction intervals for Z n+ \(t). In particular, suppose that Z n (t) is observed at the set 
< t\ < ti < . . . < tp < 5 of discrete points on the interval [0, 5]. A pointwise prediction interval 
for Z n+ \{t) is defined to be a set of lower and upper points L n+ i tOC (ti) and J7 n +i,a(ii) respectively, 
such that for every ti, i = 1, 2, . . . , P, and every a 6 (0, 1), 

F(L n+1>a (U) < Z n+1 (U) < U n+1>a (ti)) > 1 - 2a. 

Note that since we are looking at the one step prediction of Z n+ i(t) given Z n , we are in fact 
interested in the conditional distribution of Z n+ i(t) given Z n , i.e., L n+ i :Ct {ti) and U n+ i^ a {ti) are 
the lower and upper a-percentage points of the conditional distribution of Z n+ \{ti) given Z n . 

To construct such a prediction interval the following simple resampling procedure is proposed. 
Given Z n , i.e., given C(E n ), define the weights 

K(D(C(E n ),C(E m ))/h n ) 1 

^ n- 1 + Erii K(D(C(E n ),C(E m ))/h n ) 1 +n^=\ K(D(C(E n ),C(E m ))/h n )) ' 
Note that the weights have been selected appropriately so that 

n-l 

< w n>m < 1 and ^ w n ^ m = 1. 

m=l 

Now, given Z n , generate pseudo-realizations Z* +1 (t) such that for m = 1, 2, . . . , n — 1, 

P (Z* +1 (t) = Z m+ i(t) | Z n ) = w n , m , 

i.e., Z* +1 (t) is generated by choosing randomly a segment from the whole set of observed segments 
Z m+ i(t), where the probability that the (m + l)-th segment is chosen depends on how 'similar' is 
the preceding segment Z m to Z n . This 'similarity' is measured by the resampling probability w n ^ m - 

Given pseudo-replicates Z* +1 (t), calculate R* n+1 {ti) = Z* +1 (ti) - ^ +1 | n (<i), where 
is our time-domain conditional mean predictor. Let a (ti) and R^+i \- a (ti) be the lower and 
upper a-percentage points of Note that these percentage points can be consistently 

estimated by the corresponding empirical percentage points over B realizations Z*^^), b = 
1,2, ... ,5, of Z* +1 (ti). A (1 — a)100% pointwise prediction interval for Z n+ i(ti) is then obtained 
by 

{[ L* n+l!a (U), U* +1)a (ti)}, i = l,2,...,P}, 
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where L* n+ ^ a ( ti ) = R* n+1 ^(t t ) + ^ +1|n &) and U* +ha (t t ) = R* n+1A _ a (t t ) + ^ +1|n (^)- 

The following theorem shows that the proposed method is asymptotically valid, i.e., the 
so-constructed resampling-based prediction interval achieves the desired pointwise coverage 
probability. 

Theorem 4.1 Suppose that the Assumptions (A1)-(A7), given in the Appendix, are true. Then, 
for every i = 1, 2, . . . , P, and every a G (0, 1) 

lim P (L* n+lta (ti) < Z n+1 (U) < U* +ha (U) | Z ± , . . . , Z n ) > 1 - 2a. 

5 Applications to real-life datasets 

We now illustrate the usefulness of the proposed functional wavelet-kernel (W-K) approach for 
continuous-time prediction in finite sample situations by means of three real-life datasets that were 
collected from different arenas, in particular with (a) the prediction of the entire annual cycle of 
climatological El Nino-Southern Oscillation time series one-year ahead from monthly recordings, (b) 
the one-week ahead prediction of Paris electrical load consumption from half-hour daily recordings, 
and (c) the one-year ahead prediction of the Nottingham temperature data from monthly recordings. 
For the W-K approach, the interpolating wavelet transform of Donoho (1992) based on Symmlet 

6 (see Daubechies, 1992, p. 195) was used. Preliminary simulations show that the analysis is robust 
with respect to the wavelet filter, e.g., using Coiflet 3 (see Daubechies, 1992, p. 258). In the case 
where the number of time points (P) in each segment is not a power of 2, each segment is extended 
by periodicity at the right to a length closest to the nearest power of 2. The Gaussian kernel (K) 
was adopted in our analysis. Again, preliminary simulations show that our analysis is robust with 
respect to kernels with unbounded support (e.g., Laplace). The bandwidth (h n ) was chosen by the 
leave-one out cross-validation for times-series data as suggested by Hart (1996). For the associated 
95% resampling-based pointwise prediction intervals, the number of resampling samples (B) was 
taken equal to 500. 

We compare the resulting predictions with those obtained by some well-established methods in 
the literature, in particular with a smoothing spline (SS) method and with the classical SARIMA 
model. The SS method, introduced by Besse & Cardot (1996), assumes an ARH(l) structure 
for the time series Z = (Zf, i <G N + ) and handles the discretization problem of the observed 
curves by simultaneously estimating the sample paths and projecting the data on a (/-dimensional 
subspace (that the predictable part of Z assumed to belong) using smoothing splines (by solving an 
appropriate variational problem). The corresponding smoothing parameter (A) and dimensionality 
(q) are chosen by a cross-validation criterion. Following the Box-Jenkins methodology (see Box 
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& Jenkins, 1976, Chapter 9), a suitable SARIMA model is also adjusted to the times series 
Z = (Z f , i G N+). 

The quality of the prediction methods was measured by the relative mean-absolute error 
(RMAE) defined by 

where Z no is the no-th element of the time series Z and Z no is the prediction of Z no given the past. 

The computational algorithms related to wavelet analysis were performed using Version 8.02 
of the freeware WaveLab software. The entire numerical study was carried out using the Matlab 
programming environment. 

5.1 El Nino-Southern Oscillation 

This application concerns with the prediction of a climatological times series describing El Niho- 
Southern Oscillation (ENSO) during the 12-month period of 1986, from monthly observations 
during the 1950-1985 period. ENSO is a natural phenomenon arising from coupled interactions 
between the atmosphere and the ocean in the tropical Pacific Ocean. El Nino (EN) is the ocean 
component of ENSO while Southern Oscillation (SO) is the atmospheric counterpart of ENSO. 
Most of the year-to-year variability in the tropics, as well as a part of the extra-tropical variability 
over both Hemispheres, is related to ENSO. For a detailed review of ENSO the reader is referred, 
for example, to Philander (1990). 

An useful index of El Nino variability is provided by the sea surface temperatures averaged 
over the Niho-3 domain (5°S — 5°N, 150°W — 90°W). Monthly mean values have been obtained 
from January 1950 to December 1996 from gridded analyses made at the U.S. National Centers 
for Environmental Prediction (see Smith, Reynolds, Livezey &i Stokes, 1996). The time series of 
this EN index is depicted in Figure 15.11 and shows marked inter-annual variations superimposed 
on a strong seasonal component. It has been analyzed by many authors (see, for example, Besse, 
Cardot & Stephenson, 2000; Antoniadis &; Sapatinas, 2003). 

The bandwidth (h n ) for the W-K method was chosen by a cross-validation criterion and 
found equal to 0.11. We have compared our results with those obtained by Besse, Cardot & 
Stephenson (2000), using the SS method, with smoothing parameter (A) and dimensionality (q) 
chosen optimally by a cross-validation criterion and found equal to 1.6 x 10 -5 and 4, respectively. 
To complete the comparison, a suitable ARIMA model, including 12 month seasonality, has also 
been adjusted to the times series from January 1950 to December 1985, and the most parsimonious 
SARIMA model, validated through a portmanteau test for serial correlation of the fitted residuals, 
was selected. 
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Figure 5.1: The monthly mean Niho-3 surface temperature index in (deg C) which provides a 
contracted description of ENSO. 

Figure E21 displays the observed data of the 37th year (1986) and its predictions obtained by the 
W-K, SS and SARIMA methods. The RMAE of each prediction method are displayed in Table 1 
(we have taken no = 37 and P = 12). As observed in both the figure and the table, the W-K and 
SS estimators give almost similar predictions, both visually and in terms of RMAE. The prediction 
obtained by the SARIMA model is strongly and uniformly biased, failing thus to produce an 
adequate prediction (see Besse, Cardot & Stephenson, 2000, for an explanation). Note that, after 
May, all predictions are not very close to the true points and this difficulty in prediction is captured 
in Figure IS~3"1 which displays the corresponding 95% resampling-based pointwise prediction interval 
for the Niho-3 surface temperature during 1986, based on the corresponding prediction obtained 
by the W-K method. As observed in the figure, it becomes clear that as one moves from May 
onwards, this interval gets larger. 

5.2 Paris Electrical Load Consumption 

This application concerns with the one-week ahead prediction of Paris electrical load consumption 
from half-hour daily recordings. The short-term predictions are based on data sampled over 30 
minutes, obtained after eliminating certain components linked to weather conditions, calendar 
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Figure 5.2: The Nino-3 surface temperature during 1986 ( — ) and its various predictions using the 
W-K (---), SS (•••), and SARIMA (-•-) methods. 



Prediction Method 


RMAE 


W-K 
SS 
SARIMA 


0.86 % 
0.76 % 
3.72 % 



Table 1: RMAE for the prediction of Niho-3 surface temperatures during 1986 based on the W-K, 
SS and SARIMA methods. 

effects, outliers and known external actions. The dataset analyzed is part of a larger series recorded 
from the French national electricity company (EDF) during the period running from the 1st of 
August 1985 to the 4th of July 1992. The time period that we have analyzed runs 35 days, starting 
from the 24th of July 1991 to the 27th of August 1991, and it is displayed in Figure IS~H One may 
note quite a regularity in this time series and a marked periodicity of 7 days (linked to economic 
rhythms) together with a pseudo-daily periodicity. However, daily consumption patterns due to 
holidays, weekends and discounts in electricity charges (e.g., relay-switched water heaters to benefit 
from special night rates), make the use of SARIMA modelling for forecasting problematic for about 
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Figure 5.3: 95% resampling-based pointwise prediction interval (• • • ) for the Nino-3 surface 
temperature during 1986, based on the corresponding prediction obtained by the W-K ( — ) method. 
The true points (•) are also displayed. 

10% of the days when working with half-hour data (see Misiti, Misiti, Oppenheim & Poggi, 1994). 

The bandwidth (h n ) for the W-K method was chosen by cross-validation and found equal to 
0.01. We have compared our results with those obtained using the SS method, with smoothing 
parameter (A) and dimensionality (q) chosen by cross-validation and found equal to 10 -4 and 4, 
respectively. Figure 15.51 displays the observed data of the 2th August 1991 and its predictions 
obtained only by the W-K and SS methods. The RMAE of both prediction methods are displayed 
in Table 2 (we have taken no = 35 and P = 48). As observed in both the figure and the table, the 
prediction obtained by the W-K method is reasonably close to the true points, while the prediction 
made by the SS method falls far off from them. This example, clearly illustrates the impact of the 
proposed wavelet-based approach since the trajectory to be predicted seems not regular with some 
peculiar peaks. On the other hand, the smoothing spline-based approach relies upon more stringent 
smoothness assumptions and the corresponding prediction is therefore largely biased, falling well 
off the true points. Figure 15.61 displays the 95% resampling-based pointwise prediction interval 
for the half-hour electricity load consumption in Paris during 27th of August 1991, based on the 
corresponding prediction obtained by the W-K method. 
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Figure 5.4: The half-hour electricity load consumption in Paris from the 24th of July 1991 to the 
27th of August 1991. 



Prediction Method 


RMAE 


W-K 

ss 


12 % 

36 % 



Table 2: RMAE for the prediction of half-hour electricity load consumption in Paris during 27th 
of August 1991 based on the W-K and SS methods. 

5.3 Nottingham Temperature Data 

This application concerns with the one-year ahead prediction of the Nottingham temperature 
data from monthly recordings. The dataset analyzed are mean monthly air temperatures (°F) 
at Nottingham castle from January 1920 to December 1939, from 'Meteorology of Nottingham', in 
City Engineer Surveyor. Since February 1929 was an exceptionally cold month in England, and 
since the seasonal pattern is fairly stable over time, the original dataset has been 'corrected'. In 
other words, since this 'outlier' will distort the fitting process, the value of February 1929 was 
altered it to a low value for February of 35°F (see Venables & Ripley, 1999, Chapter 13). This 
'corrected' dataset, which is the series nottem in the MASS library of S-PLUS, is the one analyzed 
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Figure 5.5: The half-hour electricity load consumption in Paris during 27th of August 1991 ( — ) 
and its various predictions using the W-K ( — ) and SS (• • • ) methods. 

below. 

The bandwidth (h n ) for the W-K method was chosen by cross-validation and found equal to 2.5. 
We have compared our results with those obtained using the SS method, with smoothing parameter 
(A) and dimensionality (q) chosen by cross-validation and found equal to 10~ 4 and 1, respectively. 
To complete the comparison, a suitable ARIMA model, including 12 month seasonality, has also 
been adjusted to the times series from January 1920 to December 1938, and the most parsimonious 
SARIMA model, validated through a portmanteau test for serial correlation of the fitted residuals, 
was selected. Figure 15.81 displays the observed data of the year 1939 and its various predictions 
obtained by the W-K, SS and SARIMA methods. It is well-known in the literature (see, e.g., 
Venables & Ripley, 1999, Chapter 13) that this time series can be adequately predicted by a 
parametric model (SARIMA), but it is evident that both nonparametric models (W-K and SS) 
perform equally-well. The RMAE of each prediction method is displayed in Table 3 (we have 
taken no = 20 and P = 12). As observed in both the figure and the table, before March and after 
October, all predictions are largely biased and hence not very close to the true points. This difficulty 
in prediction is captured in Figure 15.61 which displays the corresponding 95% resampling-based 
pointwise prediction interval for the mean monthly air temperatures (°F) at Nottingham castle 
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Figure 5.6: 95% resampling-based pointwise prediction interval ( — ) for the half-hour electricity 
load consumption in Paris during 27th of August 1991, based on the corresponding prediction 
obtained by the W-K ( — ) method. The true points (•) are also displayed. 



Prediction Method 


RMAE 


W-K 

ss 

SARIMA 


30 % 
28 % 

31 % 



Table 3: RMAE for the prediction of half-hour electricity load consumption in Paris during 27th 
of August 1991 based on the W-K and SS methods. 

during 1939, based on the corresponding prediction obtained by the W-K method. As observed in 
the figure, it becomes clear that as one moves from March backwards and from October onwards, 
this interval gets larger. 

6 Conclusions 

The functional wavelet-kernel prediction methodology, of a continuous-time stochastic process on 
an entire time-interval in terms of its recent past, developed in this paper exhibits very good 
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Figure 5.7: The mean monthly air temperatures (°F) at Nottingham castle from January 1920 to 
December 1939. 

performance with respect to other well-known parametric and nonparametric techniques. As it is 
demonstrated in the real-life datasets analyzed, the proposed wavelet-based prediction methodology 
outperforms the smoothing spline-based prediction methodology for stochastic processes with 
inhomogeneous sample paths, and performs equally-well for stochastic processes with quite regular 
sample paths. Moreover, it performs reasonably well in situations where the classical seasonal 
parametric model exhibits very good predictions. We have, however, noted than when the number 
of sampling points within each time-segment is large, the curse of dimensionality leads to inefficiency 
of the proposed nonparametric prediction method unless the number of segments is really large. 
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Predicted Time Series (1939) 




Figure 5.8: The mean monthly air temperatures (°F) at Nottingham castle during 1939 ( — ) and 
its various predictions using the W-K ( — ), SS (•■•), and SARIMA ( — ) methods. 

Appendix 

Our asymptotic results will be based on the following set of assumptions, which we detail below 
before proceeding to the proofs. 



Main Assumptions 

We first impose an assumption on the sample paths of the underlying stochastic process. 

Assumption (Al): The sample paths of the strictly stationary process Z = [Z{\ i € N + ) are 
assumed to lie within a Besov space Bp q , where < s < r, 1 < p, q < oo, and r is the regularity of 
the scaling functions associated with the regular multiresolution analysis discussed in Section f2. 31 

Assumption (A2): When we only observe a fixed number P of samples values from each 
sample path, we assume that the sampled paths of the strictly stationary process Z = [Zi\ i E N + ) 
are continuous on [0,(5), and that the interpolating scaling function eft of the wavelet interpolating 
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Figure 5.9: 95% resampling-based pointwise prediction interval ( — ) for the mean monthly air 
temperatures (°F) at Nottingham castle during 1939, based on the corresponding prediction 
obtained by the W-K ( — ) method. The true points (•) are also displayed. 

basis has an exponential decay. 

Assumption (A3): The a^-mixing coefficient of the strictly stationary process Z = (Zf, i G 
N + ) satisfies 

oo 

azim) 1 - 2 / 1 = OiN- 1 ) for some I > 4. (5) 

m=N 

(Note that the scaling coefficients of Z inherit the above mixing property, according to the relevant 
discussion in Section EOl ) 

We next impose some assumptions on the joint and conditional probability density functions of 
the scaling coefficients tiff'^ ■ 

Assumption (A4): E\£?f' k \ l < oo, for I > 4 and every k = 0, 1, . . . , 2 J — 1. 

Assumption (A5): The joint probability density function f jj,k) jj,k) of (of'f , £• exist, 
it is absolute continuous with respect to Lebesgue measure, and it satisfies the following conditions 
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(*) fM,k) is Lipschitz continuous, i.e., 



fMk) Aj,k)(x 1 ,x 2 ) -f e (j,k) Mk)(yi,y 2 ) <C\\(xi,x 2 ) - {yi,y 2 )\\- 




(ii) The random vector of the scaling coefficients at scale J admits a compactly supported 
probability density function / (with support S) which is strictly positive and twice 
continuously differentiable. 

(Hi) The conditional probability density function of Q+i^ given ^ is bounded, i.e., 
fjj,k) ,Mk)(- | x) < C < oo. 



We also impose some conditions on the kernel function and the bandwidth associated with it. 

Assumption (A6): The (univariate) kernel K is a bounded symmetric density on R satisfying 
K(x) — K(y)\ < C \x — y\ for all x,y € R. Furthermore, J xK(x)dx = and f x 2 K(x)dx < oo. 



Let us now explain the meaning of the above assumptions. Assumptions (Al) (or (A2)) and 
(A3) are quite common in times series prediction (see Bosq, 1998). Assumptions (A4)-(A5) are 
essentially made on the distributional behavior of the scaling coefficients and, therefore, are less 
restrictive. They are moreover natural in nonparametric regression. Assumption (A5)-(ii) is natural 
as far as it concerns the scaling coefficients since the decay of the scaling coefficients is ensured by 
the approximation properties of the corresponding transform. Moreover, it is needed for obtained 
uniform consistency results. The conditions (A6)-(A7) are classical for kernel regression estimation. 

Proof of Theorem 13.11 The difference in the two assertions of the theorem is due to the nature 
of the observations. In case (i), each observed segment is a time series with fixed, finite length, and 
we use an interpolating wavelet transform at the appropriate resolution J that makes interpolation 
error negligible, i.e., 



In case (ii), the observed segments are continuous-time stochastic processes and one can not neglect 
anymore the approximation error due to the projection of the observed segment onto the scaling 
space Vj. However, under our assumptions, and according to the results recalled in Section 2, this 
error is uniformly bounded by a constant times 2~ sJ , where s denotes the Besov smoothness index 
s, i.e., 



Assumption (A7): The bandwidth h satisfies h — ► and nh 



co as n 



E(Vj(Z n+1 ) \Z n )= E(Z n+1 ) \Z n ). 




(6) 
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resulting in a different rate for the second case. Hence, in both cases, we proceed by deriving the 
appropriate rates for 

\K +l \ n -HTj{Z n+l )\Z n )\\. 

We first show that, as n — ► oo, 

||S re+1 | n - E(H n+ i | H n )|| -> 0, almost surely. (7) 
For this, it suffices to show that, for every k = 0, 1, . . . , 2 — 1, as n —>■ oo, 

Let x 6 M 2 " 7 , let H n+1 | n (x) be the value of H n+1 | n in @ for H n = x, and denote by C^+i^C^) 
the fc-th component of H n+ i| n (x). Consider the 2^-dimensional random variable W\ = C(H;), and 
denote by fjj,k) w and fw l the joint and marginal densities of (^ff'j^jWj) and Wj, respectively. 
Notice that because of (A4), and the fact that Wi is a linear transformation of 3/, fjj,h) w and fw t 

il + l W l 

exist with respect to Lebesgue measure for every k = 0, 1, . . . , 2 J — 1. Let 

n-l 

f Wl (x) = (nhl')- 1 K(D(x,W m )/h n ) 



m=l 



and notice that * s the kernel estimator of the 2 -dimensional density fwi(x)- For 

notational convenience, in what follows, let <I> nj fc(x) = E(£„j^ | = x) and g n> fc(x) = 

We then have 

(8) 

Using now the assumptions of the theorem, the above decomposition, Lemma 2.1 and Theorem 3.2 
of Bosq (1998), it follows that, as n — > oo, 

l/(2+2 J ) N 



^ 1 ) |n (x)-E(eif 1 ) |d J ' fc) =x)|=0^(^) j, almost surely. (9) 

Recalling now that our estimator is defined as 



sup 



2 J -1 

Z i+l\n( t ) = Yl Ct-l|nto 
k=0 

using the rate given in expression @, and the fact that we have used a regular multiresolution 
analysis, we have, as n — ► oo, 

2 J -1 

sup\zJ(t)-E{Pj(Z n+1 (t))\Z n )\ = 2 J / 2 max 
t k 



d\n - E(e&? I sup £ 10(2^ - 

* fc=0 

/ . /Wn\ 1/(2+2 ' J) \ 
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The above bound (|1(J[). together with inequality ensures the validity of both the assertions (i) 
and (ii). This completes the proof of Theorem 13.11 □ 

Proof of Theorem 14.11 For every ij G {t\,t2, ■ ■ ■ , tp}, note that Z n+ i(ti) = £^44. Since 



— Z n+l{ti) - E(Z n+ i{ti) I Z n ) + (Z n+1 | n (ij) - ^(Zn+i^i) I Z n )), 

and, as in 00, |.E(Z n+ i(ij) \Z n ) — Z^ +1 ^ n (ti)\ — > in probability, it suffices to show that the 
distribution of Z* +1 (ti) — E(Z n -\-\(tj) \Z n ) approximates correctly the conditional distribution of 
Z n+ i(ti) - E(Z n+ i(ti) I Z n ) given Z n . 

Now, given Z n = x, i.e., given H n = x, we have 

n-l 

F(Z* +1 (ti) - E(Z n+ i{ti) I Z n = x) < y) = ^2 1 (-oo,y]( Z m+i(U) ~ E(Z n+1 (ti) \ Z n = x))w n , m 

m=l 

n-l 

= ^2 ^(-oo,y](Zm+l(ti))w nt m 
m=l 
n—1 

m=l 

Emill(-oo,a(d+l)^(^(C(g).C(5rn))/fcn) 

n^ + Yr-XK{DiC{xlC{^ m ))/h n ) 
+0(n" 1 ), 

where y = y + E(Z n+ \(ti) \Z n = x). Note that (|11|) is a kernel estimator of the conditional mean 
^(l(-oo,$(£ n +i) I s n = 5?) = ^(Ci+i < 2/ 1 = x), i.e., of the conditional distribution of Q"^ given 
that H n = x. Denote now the conditional distribution of given E n by ,„ (• | H n ) and its 



?n + l I 

kernel estimator given in (|11|) by F (• | H n ). Then by the same arguments as in Theorem 13. II 

we get that, for every y £ R, as n — > 00 



sup 



^WO 1 o (y I a;) - F f (J,i) , n (y I ^ 



0, in probability. 



It remains to show that the above convergence is also uniformly over y. Fix now a x in the 
support S of H n , and let e > arbitrary. Since Fjj,i) „ (y | x) is continuous we have that, for every 

?n + l I = " 

/c G N, points —00 = yo < y\ < . . . < y^\ < yk = °o exist such that F ' (yi\x) = i/k. For 

?n+l I 

yi-i < y < yi) an d using the monotonicity of F . „ and F . „ , we have 

Fjja 1 c (y,-i I x) - F jj,i) (yi\x) < F w) (y\x) - F (JA (y\x) 



< F 



1 - (Vi\x)- F (yi\x). 
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/From this, we get 



F AJA in (y\ x )- F AJ-*) 1 77 (y\ x ) 



< sup 



1 



and, therefore, 



F f W) i = (y I x ) ~ F Mi) I n (2/ 1 x ) 



> e < P sup 



^(•V) , n (Vi\x)~ F M) | „ (l/i | 
? n+ l I -n ? n+ i I — n 



+ > e 



< 



sup sup 

i x 



F Mi) | - (Vi\x)- F ' (./,,) (#| x) 



+ AT 1 > e 



Now, chose k large enough such that 1/k < e/2. For such a fixed &;, and because, for every y € R, 
as n — ► oo, 



sup 



f aja in (y k) - ^c^.i) ,n (y I x ) 



0, in probability, 



we can choose n large enough such that 



sup sup 

Ki<k x 



F fW) ,n (y» I s ) - i n (y« I s ) 

Sn+1 I " n ?n + l I " n 



> e/2 < r, 



for any desired r. Since r is independent on y and x, the desired convergence follows. This 
completes the proof of Theorem 14.11 □ 
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